A Guide to Encoding Texts for Natural Language Processing

tutorial
Author

Adrian Leung

Published

January 17, 2025

Source: Google

Introduction

Natural Language Processing (NLP) is a fascinating field of Machine Learning that focuses on enabling machines to understand, interpret, and generate human language. From facilitating translation tools like Google Translate to powering voice assistance like Siri and Alexa, NLP’s influence on the current technological landascape is indisputable and substantial. With the ever growing interest and expanding development in artificial intelligence, a lot of aspiring engineers and scientists are looking to venture into the lucrative NLP field. However, before we learn to perform all the fancy tasks such as text generations and summarizations, we must start from the fundamentals and ask ourselves a question:

How can we bridge the gap between human communication and machine processing?

One problem arises from this question is that computers do not understand language the way humans do. They are not wired to comprehend words and write essays like we do. Instead, they operate on numbers. All the NLP models are driven by mathematical algorithms and formulae. Thus, encoding text into numerical representations becomes the key to computers learning human language. By converting words, sentences, or entire documents into numbers, NLP models can perform a wide range of tasks like analyzing patterns, extracting meanings, and generating responses.

This blog will introduce and guide you through different methods and tools to encode texts, thus providing you a gateway to using NLP models.

Challenges of Textual Data

Before we learn about different ways to encode texts, we need to acknowledge the challenges associated with the intricacy of human language. Language is messy and unpredictable. Although there are sets of grammatical rules that govern a language, humans are prone to making mistakes yet still able to convey their messages. For example, “How is you doin” is grammatically incorrect but we know it means “How are you doing”. Thus, language is not strictly restricted by an algorithm, contrary to how computers operate.

Moreover, not all words have meanings. Auxiliary verbs like “is” and “am” do not contribute or change the message a sentence wants to convey. They are meaningless outside of abiding by grammatical rules. There is also a hierarchy of meanings in a sentence. Certain words can mean more than others. Consider the sentence “We are happy”. Although “we” and “happy” play their roles in conveying our emotions, “happy” is a more important word as it tells the key emotion.

To complicate matters more, a word can have different meanings depending on the context. Even more confusingly, some words can have completely opposite meanings. For example, the word “left” in the sentence “We just left” means departed. However, it means staying when the sentence is “We are the only one left”. This shows that contexts can alter meanings of the same word drastically.

Context matters!!! (Source: Kamala Harris)

Hence, making computers comprehend language like we do is far from a simple task. Encoding words with numerical representations is a work of art as it determines how well a model can understand us.

Approaches to Encoding Texts

This section will cover different approaches to encoding texts including traditional methods like Bag-of-Words (BoW) and TF-IDF, word embeddings, and contextualized embeddings.

Traditional Methods

Bag-of-Words (BoW)

BoW is one of the most popular encoding methods. It encodes each unique word from all input documents with a number based on their count or presence in their respective document.

Consider the example below:

Unique words in all documents: [‘the’, ‘bird’, ‘is’, ‘cat’, ‘and’, ‘dog’, ‘hate’, ‘each’, ‘no’, ‘other’]

And we pick one of the documents for our first BoW representation as below.

Document: “The cat and the cat hate each other.”

In the case of encoding each word with its count, the BoW model will transform the document to the representation in Table 1. Since we have two ‘the’ and ‘cat’ in the document above, the numerical representations for ‘the’ and ‘cat’ in this document are 2.

the bird is cat and dog hate each no other
2 0 0 2 1 0 1 1 0 1
Table 1: BoW representations using word counts

And in the case of measuring each word by its presence as seen in Table 2, BoW uses binary values 0 and 1 to represent each word, where 0 implies absence and 1 implies presence. Note that the words ‘the’ and ‘cat’ are represented by 1 instead of 2 since we are using binary representations.

the bird is cat and dog hate each no other
1 0 0 1 1 0 1 1 0 1
Table 2: BoW representations using binary indicators

To extract BoW representations in Python [1], we can leverage the CountVectorizer function from the scikit-learn package [2]. Table 3 demonstrates using CountVectorizer for BoW extraction by word count with the same example.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Using above example
documents = [
    'The cat and the cat hate each other.',
    'The dog is the bird.',
    'No, the bird and cat hate other dog.'
]

bow = CountVectorizer()
X = bow.fit_transform(documents)
bow_df = pd.DataFrame(
    X.toarray(), columns=bow.get_feature_names_out(), index=documents
)
bow_df
and bird cat dog each hate is no other the
The cat and the cat hate each other. 1 0 2 0 1 1 0 0 1 2
The dog is the bird. 0 1 0 1 0 0 1 0 0 2
No, the bird and cat hate other dog. 1 1 1 1 0 1 0 1 1 1
Table 3: BoW results from CountVectorizer

Although the BoW method is as intuitive and self-explanatory as it seems, it is far from a perfect model as it discards the word order in the original document. It disregards how words can form meaningful word phrases and change their meanings with respect to the context.

TF-IDF

TF-IDF, which stands for term frequency-inverse document frequency, is another popular method to encode text. It is a measure of relevance of a word in a document. The computation can be broken down into two parts: term frequency and inverse document frequency.

Term Frequency (TF)
The term frequency is the count of a given word \(w\) in a given document \(d\), divided by the total number of words in document \(d\).

\[TF = \frac{\text{Number of word $w$ in document $d$}}{\text{Total number of words in document $d$}}\]

Inverse document frequency (IDF)
The inverse document frequency is to penalize words that are too common across all documents. For example, auxiliary verbs like ‘is’ are weighed less as the result. In return, this gives rise to rarer words that possibly carry more meaning and importance.

\[IDF = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing word $w$}}\right)\]

Combining both TF and IDF
To have our TF-IDF representation, we multiply both terms to have the following formula:

\[TF\text{-}IDF = TF \times IDF\]

Now, let’s revisit our example with TF-IDF in Python [1]. Luckily, the scikit-learn package [2] also has a function for TF-IDF called TfidfVectorizer. As shown in the first row in Table 4, the TF-IDF representation for ‘the’ is smaller than ‘cat’ even though they both have the same BoW representations in Table 3. This is the result of the compensation from IDF as ‘the’ appears too much across documents.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Using the same example
documents = [
    'The cat and the cat hate each other.',
    'The dog is the bird.',
    'No, the bird and cat hate other dog.'
]

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(documents)
tfidf_df = pd.DataFrame(
    X.toarray(), columns=tfidf.get_feature_names_out().tolist(), index=documents
)
tfidf_df
and bird cat dog each hate is no other the
The cat and the cat hate each other. 0.299594 0.000000 0.599187 0.000000 0.39393 0.299594 0.000000 0.000000 0.299594 0.465322
The dog is the bird. 0.000000 0.403525 0.000000 0.403525 0.00000 0.000000 0.530587 0.000000 0.000000 0.626747
No, the bird and cat hate other dog. 0.346438 0.346438 0.346438 0.346438 0.00000 0.346438 0.000000 0.455524 0.346438 0.269040
Table 4: TF-IDF results from TfidfVectorizer

TF-IDF is a step-up from BoW as it recognizes what makes a word important in a document. However, similar to BoW, it also disregards the order and context of words. Thus, we need some alternatives that can compute an even better representation for words.

Word Embeddings

This is where word embeddings come into play. Contrary to traditional methods, word embeddings encode words in vector forms (from Linear Algebra!). Through vector representations, they can encapsulate the relationships between words by showing their similarity numerically. Mathematically, the similarity in words is measured by how close the embeddings are in the vector space. Figure 1 shows an example of visualizing word embeddings in a 2-dimensional space. As you can see, words that are similar in meaning or context are clustered together. This is the art of word embeddings!

Figure 1: Visualization of word embeddings (Source: Ruben Winastwan)

Word2Vec

One common tool to obtain word embeddings is Word2Vec [3]. It computes the embeddings by leveraging the architecture of two-layer neural networks. There are two approaches that Word2Vec uses to obtain these embeddings.

Continuous BoW (CBoW)

CBoW is a prediction algorithm where the neural network aims to predict a target word based on the existing context in a document. Simply put, this is analogous to filling in the blank in a sentence.

Consider the sentence “My cute puppy is barking”. The model will iterate over this sentence and remove one word from each iteration. For example, as shown in Figure 2, the model omits the word ‘puppy’ from the sentence and trains the neural network to guess the word ‘puppy’ from the remaining sentence.

Figure 2: Illustration of CBoW algorithm

Skipgram

Skipgram is the complete opposite of CBoW. Instead of predicting the missing word from a given context, skipgram predicts the surrounding context from a given word. Using the same example, as shown in Figure 3, the model will try to guess the surrounding words to the word ‘puppy’.

Figure 3: Illustration of Skipgram algorithm

After multiple iterations in the training process, Word2Vec will use the learned weights in the neural network from either of the approaches to construct the word embeddings for each word.

Applying Word2Vec in Python [1] is made possible with the package Gensim [4]. As seen in Listing 1, Gensim‘s Word2Vec takes in a list of lists to generate the word embeddings. Note that the sg argument in the function lets you choose between CBoW and skipgram, where 0 and 1 corresponds to CBoW and skipgram respectively. The argument min_count tells the model to ignore words that have a fewer count than this minimum. After training Word2Vec on our sample text, we can take a quick look into how word embeddings look like for the word ’technology’ in Listing 1.

Listing 1: This code demonstrates extracting word embedding from the word ‘technology’
import pandas as pd
from gensim.models import Word2Vec
# Generated by ChatGPT
sample_text = [
    "The advancement of technology has transformed the way we communicate and interact with the world.",
    "Artificial intelligence is increasingly being used in healthcare, education, and other industries to enhance efficiency.",
    "People often gather in coffee shops to discuss ideas, share stories, and enjoy a sense of community.",
    "Self-driving cars and smart home devices are examples of how technology is becoming a part of our everyday lives.",
    "Art galleries and cultural festivals are popular spots for people to explore creativity and connect with others.",
    "The integration of AI in the workplace has sparked debates about its impact on jobs and productivity.",
    "Reading books and attending literary events remain cherished activities in the digital age.",
    "Many cities are blending technology with traditional practices to create unique and thriving environments.",
    "The use of virtual reality in gaming and training has opened new possibilities for immersive experiences.",
    "Social media platforms have changed the way we form relationships and share information globally."
]
# This generates a list of lists
sample_sentences = [sent.split() for sent in sample_text]

w2v = Word2Vec(sample_sentences, min_count = 1, sg = 1)
print(w2v.wv['technology'])
[ 8.1346482e-03 -4.3696621e-03 -1.0951435e-03  1.0827162e-03
 -1.6076697e-04  1.0314664e-03  6.1621480e-03  1.0513653e-04
 -3.3200562e-03 -1.6226495e-03  5.8745840e-03  1.3788766e-03
 -6.8573974e-04  9.4051417e-03 -4.8837205e-03 -9.1557507e-04
  9.1908397e-03  6.7008133e-03  1.4975395e-03 -9.0892995e-03
  1.2369623e-03 -2.2766890e-03  9.4154952e-03  1.1043868e-03
  1.5170779e-03  2.3703994e-03 -1.9285623e-03 -4.9641491e-03
  1.0452364e-04 -2.0255600e-03  6.6222828e-03  8.9008864e-03
 -5.9560389e-04  2.8281522e-03 -6.1490987e-03  1.7546985e-03
 -6.8589435e-03 -8.6309118e-03 -5.9207552e-03 -9.0170074e-03
  7.2529181e-03 -5.8431132e-03  8.1692915e-03 -7.1991798e-03
  3.4998127e-03  9.6219182e-03 -7.8216838e-03 -9.9756923e-03
 -4.2250603e-03 -2.6117193e-03 -2.6378274e-04 -8.8602239e-03
 -8.5995235e-03  2.7603817e-03 -8.2284901e-03 -9.0225162e-03
 -2.3512202e-03 -8.6695738e-03 -7.1790209e-03 -8.3399629e-03
 -2.7452479e-04 -4.5728176e-03  6.6562551e-03  1.5371529e-03
 -3.3772779e-03  6.1904443e-03 -5.9688864e-03 -4.5542065e-03
 -7.3182448e-03 -4.1977805e-03 -1.7964148e-03  6.5716160e-03
 -2.7138642e-03  4.9592862e-03  6.9808913e-03 -7.4309111e-03
  4.5860992e-03  6.1515258e-03 -2.9300000e-03  6.6001783e-03
  6.0655614e-03 -6.4467844e-03 -6.8669897e-03  2.5851957e-03
 -1.7241427e-03 -6.1018453e-03  9.5864786e-03 -5.1106275e-03
 -6.4399107e-03 -4.0861694e-05 -2.5958647e-03  5.0307182e-04
 -3.4884498e-03 -3.8938067e-04 -6.8298483e-04  8.8520558e-04
  8.1980843e-03 -5.7321084e-03 -1.6760164e-03  5.5565243e-03]

As it turns out, the word embedding for ‘technology’ is a high-dimensional vector. However, the values are much more ambiguous than the previous representations we have learned earlier. Let’s try to translate these embeddings into more comprehensible results.

As mentioned before, word embeddings are powerful at capturing similarities in words. We are putting this to test in Listing 2, where we ask Word2Vec what the most similar word to ‘technology’ is. The result shows that ‘advancement’ is the closest word choice. This is a reasonable pick since we often use the phrase ‘technological advancement’ when describing new technological milestones.

Listing 2: This code shows what Word2Vec thinks the most similar word to ‘technology’ is.
print(w2v.wv.most_similar('technology')[0])
('advancement', 0.3503554165363312)

Word2Vec is a big improvement from traditional methods. Nevertheless, it still has flaws such as failing to recognize unknown words. Since Word2Vec is a pre-trained model, its linguistic knowledge mainly bases on the data corpus it was trained on. Unfortunately, language changes over time as new words continue to pop up in dictionaries every year. Thus, it is only a matter of time that Word2Vec will become outdated. Another limitation of Word2Vec is that it struggles to differentiate between words with multiple meanings. Since Word2Vec generates a word embedding for each unique word, it fails to acknowledge that a word can carry separate meanings. This can be problematic especially when words can have exact opposite meanings given different contexts.

Contextualized Embeddings

Similar to word embeddings, contextualized embeddings also encode texts into high-dimensional vectors. They build on top of the existing framework of word embeddings by conditioning each word on its context. When obtaining the contextualized embeddings of a word, it includes neighbouring words in the calculation. As such, same words that appear in different contexts will have different embeddings. This provides a solution to the limitations of word embeddings as it learns the word based on the context surrounding it.

While there are several architectures, such as ELMo and GPT-2, that are trained on obtaining contextualized embeddings, we will be focusing on learning the BERT model in this section.

Bert from Sesame Street. Coincidentally, the popular contextualized embeddings are conveniently named as Sesame Street characters (Source: Sesame Street)

BERT

BERT [5], which stands for Bidirectional encoder representations from transformers, is a transformer-based encoder model developed in 2018 by researchers at Google [6]. It is bidirectional in the sense that it captures the both left and right contexts of a given word. For example, consider the sentence “You exist in the context of all in which you live” and the target word “context”. The bidirectional nature of BERT

References

1.
Van Rossum G, Drake Jr FL (1995) Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam
2.
Pedregosa F, Varoquaux G, Gramfort A, et al (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830
3.
Mikolov T (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781 3781
4.
Rehurek R, Sojka P (2011) Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3(2)
5.
6.